The Importance of Window Length in Splice Site Prediction

نویسندگان

  • Leila Taher
  • Burkhard Morgenstern
  • Peter Meinicke
چکیده

The performance of gene prediction programs strongly depends on the methods that they use to locate splice sites. Different pattern recognition techniques are available to assess the quality of candidate splice sites, see [1] for an overview and further references. All of these techniques proceed by computing a score derived from the distribution of the nucleotides in the neighbourhood of a splice site consensus sequence. These scores are normally obtained with splice sites models that have been estimated from large training sets of exemplary neighbourhoods. The training sets may also include negative examples, i.e. sequences that contain the consensus sequence, but that are actually no splice sites. Unfortunately, the concept of ‘neighbourhood’ is rather ambiguous, and there is no general recommendation about the positions of the nucleotides that should be included in the calculation, i.e. the analysis window that should be employed. In principle, the window length is an important parameter, because it determines the amount of information that has to be evaluated. On the one hand, the window should be long enough to provide as many details as possible about the patterns. On the other hand, the window should be short enough to take only the relevant information, in order to improve generalization. In the present study, we investigate how splice-site prediction accuracy depends on the window size and shape, using support vector machines (SVM) [2]. Our results show that the choice of the window is crucial for splice site prediction, and therefore we suggest that the window length should be considered as an essential parameter of the model.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pre-mRNA Secondary Structure Prediction Aids Splice Site Prediction

Accurate splice site prediction is a critical component of any computational approach to gene prediction in higher organisms. Existing approaches generally use sequence-based models that capture local dependencies among nucleotides in a small window around the splice site. We present evidence that computationally predicted secondary structure of moderate length pre-mRNA subsequencies contains i...

متن کامل

Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA.

Prediction of splice site selection and efficiency from sequence inspection is of fundamental interest (testing the current knowledge of requisite sequence features) and practical importance (genome annotation, design of mutant or transgenic organisms). In plants, the dominant variables affecting splice site selection and efficiency include the degree of matching to the extended splice site con...

متن کامل

Identification of a Novel Splice Site Mutation in RUNX2 Gene in a Family with Rare Autosomal Dominant Cleidocranial Dysplasia

Introduction: Pathogenic variants of RUNX2, a gene that encodes an osteoblast-specific transcription factor, have been shown as the cause of CCD, which is a rare hereditary skeletal and dental disorder with dominant mode of inheritance and a broad range of clinical variability. Due to the relative lack of clinical complications resulting in CCD, the medical diagnosis of this disorder is challen...

متن کامل

Dataset Construction for Gene Structure Prediction and Alternative Splicing Analysis

The performance of gene finding from genome sequences strongly depends on the accuracy of splice site prediction. Recent gene finding programs, however, still do not reach enough levels. To improve the accuracy of splice site prediction, it is required to understand the splicing mechanism and to make a model from clear experimental evidences. For this purpose, genomic full-length precursor mRNA...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003